Zipfian distribution (zipfian)#
The Zipfian distribution is a discrete power-law distribution over ranks. It is commonly used when a few items are very frequent (low ranks) and many items are rare (high ranks).
Throughout, we’ll use SciPy’s parameterization scipy.stats.zipfian(a, n), which is a finite-support (truncated) Zipf law on ranks \(\{1,\dots,n\}\).
Notebook roadmap#
Define the distribution (PMF/CDF) and connect it to Zipf’s law.
Work out moments/properties (including when moments exist in the \(n\to\infty\) limit).
Derive likelihood and a practical MLE for the exponent.
Implement NumPy-only sampling via inverse transform.
Visualize PMF/CDF and Monte Carlo behavior.
Use SciPy’s
zipfianfor evaluation, sampling, and fitting.
import math
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from scipy import stats
from scipy.optimize import minimize_scalar
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=6, suppress=True)
SEED = 7
rng = np.random.default_rng(SEED)
print("numpy:", np.__version__)
import scipy
print("scipy:", scipy.__version__)
numpy: 1.26.2
scipy: 1.15.0
1) Title & Classification#
Name:
zipfian(Zipfian / truncated Zipf law)Type: Discrete
Support (SciPy default
loc=0):\[\mathcal{S}=\{1,2,\dots,n\}\]With a location shift
loc, the support becomes \(\{\text{loc}+1,\dots,\text{loc}+n\}\).Parameter space (SciPy):
\[a>0,\qquad n\in\{1,2,3,\dots\}\]where:
\(a\) is the exponent (controls tail heaviness)
\(n\) is the maximum rank (finite truncation)
2) Intuition & Motivation#
What it models#
A Zipfian distribution models ranked outcomes where the probability of rank \(k\) decays approximately like a power law:
This captures “few head, long tail” behavior: rank 1 is common, rank 2 is less common, and so on, with many rare high-rank outcomes.
Typical real-world use cases#
Word frequencies (Zipf’s law): common words occur extremely often; most words are rare.
Popularity / demand: product views, song plays, page visits (ranked by popularity).
Discrete heavy tails with a hard cutoff: when only the top \(n\) ranks are possible (finite catalog / vocabulary).
Relations to other distributions#
scipy.stats.zipf: the infinite-support Zipf / zeta distribution on \(\{1,2,\dots\}\).Pareto (continuous analogue): power law on \([x_{\min},\infty)\).
Truncation:
zipfianis essentially a truncated power law; as \(n\to\infty\) and \(a>1\), it approaches the zeta distribution.
3) Formal Definition#
Define the generalized harmonic number
PMF#
For \(k\in\{1,\dots,n\}\),
(And \(\Pr(X=k)=0\) outside the support.)
CDF#
For real \(x\),
Because the distribution is discrete, the CDF is a step function.
def _validate_a_n(a, n):
a = float(a)
if not np.isfinite(a) or a <= 0.0:
raise ValueError("a must be a finite float > 0")
if isinstance(n, bool) or not isinstance(n, (int, np.integer)):
raise TypeError("n must be an integer")
n_int = int(n)
if n_int < 1:
raise ValueError("n must be >= 1")
return a, n_int
def _logsumexp_np(log_w: np.ndarray) -> float:
# NumPy-only stable log-sum-exp for 1D arrays.
log_w = np.asarray(log_w, dtype=float)
m = np.max(log_w)
if not np.isfinite(m):
return float(m)
return float(m + np.log(np.sum(np.exp(log_w - m))))
def zipfian_logZ(a: float, n: int) -> float:
# log H_{n,a} computed stably.
a, n = _validate_a_n(a, n)
ks = np.arange(1, n + 1, dtype=float)
return _logsumexp_np(-a * np.log(ks))
def zipfian_logpmf(k, a: float, n: int):
# Log PMF for the Zipfian distribution on {1,...,n}.
a, n = _validate_a_n(a, n)
k_arr = np.asarray(k)
out = np.full(k_arr.shape, -np.inf, dtype=float)
k_int = k_arr.astype(int)
valid = (k_int == k_arr) & (k_int >= 1) & (k_int <= n)
if not np.any(valid):
return out
logZ = zipfian_logZ(a, n)
out[valid] = -a * np.log(k_int[valid]) - logZ
return out
def zipfian_pmf(k, a: float, n: int):
return np.exp(zipfian_logpmf(k, a, n))
def zipfian_pmf_array(a: float, n: int):
# Return support ks and PMF values as arrays.
a, n = _validate_a_n(a, n)
ks = np.arange(1, n + 1)
logZ = zipfian_logZ(a, n)
pmf = np.exp(-a * np.log(ks) - logZ)
pmf = pmf / pmf.sum() # guard against floating drift
return ks, pmf
def zipfian_cdf(x, a: float, n: int):
# CDF as a step function using the PMF array.
a, n = _validate_a_n(a, n)
x_arr = np.asarray(x)
ks, pmf = zipfian_pmf_array(a, n)
cdf_vals = np.cumsum(pmf)
cdf_vals = np.clip(cdf_vals, 0.0, 1.0)
k = np.floor(x_arr).astype(int)
out = np.zeros_like(x_arr, dtype=float)
out[x_arr >= n] = 1.0
inside = (x_arr >= 1) & (x_arr < n)
if np.any(inside):
out[inside] = cdf_vals[k[inside] - 1]
return out
# Quick sanity check
a, n = 1.3, 20
ks, pmf = zipfian_pmf_array(a, n)
print("sum pmf:", pmf.sum())
print("cdf(n):", zipfian_cdf(n, a, n))
sum pmf: 0.9999999999999998
cdf(n): 1.0
4) Moments & Properties#
A convenient way to express moments uses the generalized harmonic numbers.
Raw moments#
For integer \(m\ge 0\),
In particular:
Mean and variance#
Because zipfian has finite support, all moments exist for any \(a>0\). However, in the limit \(n\to\infty\) (the zeta distribution), normalization requires \(a>1\) and higher moments require progressively larger \(a\).
MGF and characteristic function#
With finite support,
Entropy#
Using \(\log p_k=-a\log k-\log H_{n,a}\), we can rewrite
def zipfian_raw_moment(m: int, a: float, n: int) -> float:
a, n = _validate_a_n(a, n)
ks = np.arange(1, n + 1, dtype=float)
logk = np.log(ks)
logZ = zipfian_logZ(a, n)
# E[X^m] = sum exp((m-a) log k - logZ)
return float(np.exp(_logsumexp_np((m - a) * logk) - logZ))
def zipfian_moments(a: float, n: int):
a, n = _validate_a_n(a, n)
m1 = zipfian_raw_moment(1, a, n)
m2 = zipfian_raw_moment(2, a, n)
m3 = zipfian_raw_moment(3, a, n)
m4 = zipfian_raw_moment(4, a, n)
var = m2 - m1**2
if var <= 0.0:
skew = float("nan")
excess_kurt = float("nan")
else:
mu3 = m3 - 3.0 * m1 * m2 + 2.0 * m1**3
mu4 = m4 - 4.0 * m1 * m3 + 6.0 * (m1**2) * m2 - 3.0 * m1**4
skew = mu3 / (var ** 1.5)
excess_kurt = mu4 / (var**2) - 3.0
return {
"mean": m1,
"var": var,
"skew": skew,
"kurtosis": excess_kurt + 3.0,
"excess_kurt": excess_kurt,
}
def zipfian_expected_log(a: float, n: int) -> float:
ks, pmf = zipfian_pmf_array(a, n)
return float(np.sum(pmf * np.log(ks)))
def zipfian_entropy(a: float, n: int, *, base=math.e) -> float:
ks, pmf = zipfian_pmf_array(a, n)
mask = pmf > 0
H_nats = -np.sum(pmf[mask] * np.log(pmf[mask]))
if base == math.e:
return float(H_nats)
return float(H_nats / math.log(base))
def zipfian_log_mgf(t: float, a: float, n: int) -> float:
a, n = _validate_a_n(a, n)
ks = np.arange(1, n + 1, dtype=float)
logZ = zipfian_logZ(a, n)
logp = -a * np.log(ks) - logZ
return _logsumexp_np(logp + t * ks)
def zipfian_mgf(t: float, a: float, n: int) -> float:
return float(np.exp(zipfian_log_mgf(t, a, n)))
def zipfian_cf(t: float, a: float, n: int) -> complex:
a, n = _validate_a_n(a, n)
ks = np.arange(1, n + 1, dtype=float)
logZ = zipfian_logZ(a, n)
logp = -a * np.log(ks) - logZ
return np.sum(np.exp(logp) * np.exp(1j * t * ks))
a, n = 1.2, 50
mom = zipfian_moments(a, n)
H_direct = zipfian_entropy(a, n)
H_via_formula = zipfian_logZ(a, n) + a * zipfian_expected_log(a, n)
{
**mom,
"entropy_nats": H_direct,
"entropy_nats_check": float(H_via_formula),
"mgf(t=0.1)": zipfian_mgf(0.1, a, n),
"cf(t=1.0)": zipfian_cf(1.0, a, n),
}
{'mean': 8.483213566153296,
'var': 123.37129111277592,
'skew': 1.891549252553807,
'kurtosis': 5.885075724224906,
'excess_kurt': 2.8850757242249063,
'entropy_nats': 2.85361543255586,
'entropy_nats_check': 2.85361543255586,
'mgf(t=0.1)': 6.882518427186466,
'cf(t=1.0)': (0.03586759328104913+0.31936532991899086j)}
5) Parameter Interpretation#
Exponent \(a\) (tail heaviness)
Smaller \(a\) (closer to 0) makes the distribution flatter (closer to uniform on \(\{1,\dots,n\}\)).
Larger \(a\) concentrates mass on small ranks; the tail decays faster.
Truncation \(n\) (maximum rank)
Increasing \(n\) extends the tail to allow rarer outcomes.
For fixed \(a\), increasing \(n\) increases the normalizer \(H_{n,a}\), so the probabilities of low ranks decrease slightly.
A common diagnostic is a log–log plot of PMF vs rank: power laws appear approximately linear with slope \(-a\) (up to truncation effects).
n_fixed = 80
a_values = [0.6, 1.0, 1.6, 2.5]
n_values = [20, 80, 300]
a_fixed = 1.2
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=(
f"PMF vs a (n={n_fixed})",
f"PMF vs n (a={a_fixed})",
),
)
# Left: vary a
ks = np.arange(1, n_fixed + 1)
for a_ in a_values:
fig.add_trace(
go.Scatter(
x=ks,
y=zipfian_pmf(ks, a_, n_fixed),
mode="markers+lines",
name=f"a={a_}",
),
row=1,
col=1,
)
# Right: vary n
for n_ in n_values:
ks_ = np.arange(1, n_ + 1)
fig.add_trace(
go.Scatter(
x=ks_,
y=zipfian_pmf(ks_, a_fixed, n_),
mode="markers+lines",
name=f"n={n_}",
),
row=1,
col=2,
)
fig.update_xaxes(title_text="rank k", type="log", row=1, col=1)
fig.update_yaxes(title_text="P(X=k)", type="log", row=1, col=1)
fig.update_xaxes(title_text="rank k", type="log", row=1, col=2)
fig.update_yaxes(title_text="P(X=k)", type="log", row=1, col=2)
fig.update_layout(title="Zipfian PMF: shape changes", legend_title_text="parameter")
fig.show()
6) Derivations#
Expectation#
\begin{aligned} \mathbb{E}[X] &= \sum_{k=1}^{n} k,\Pr(X=k) = \sum_{k=1}^{n} k,\frac{k^{-a}}{H_{n,a}}\ &= \frac{\sum_{k=1}^{n} k^{1-a}}{H_{n,a}} = \frac{H_{n,a-1}}{H_{n,a}}. \end{aligned}
Variance#
\begin{aligned} \mathrm{Var}(X) &= \mathbb{E}[X^2]-\mathbb{E}[X]^2 = \frac{H_{n,a-2}}{H_{n,a}} - \left(\frac{H_{n,a-1}}{H_{n,a}}\right)^2. \end{aligned}
Likelihood (i.i.d. data)#
Let \(x_1,\dots,x_m\) be i.i.d. with \(x_i\in\{1,\dots,n\}\). The log-likelihood for \(a\) (with \(n\) treated as known) is
\begin{aligned} \ell(a) &= \sum_{i=1}^{m}\log\Pr(X=x_i\mid a,n)\ &= \sum_{i=1}^{m}\left(-a\log x_i-\log H_{n,a}\right)\ &= -a\sum_{i=1}^{m}\log x_i - m\log H_{n,a}. \end{aligned}
If \(n\) is unknown but the support is truncated, we must have \(n\ge\max_i x_i\). For any fixed \(a>0\), \(H_{n,a}\) is increasing in \(n\), which makes the likelihood decrease in \(n\). Hence the MLE for \(n\) is often the boundary:
(Practically, \(n\) is usually known from context: vocabulary size, catalog size, maximum rank considered, etc.)
def _validate_sample(data, n: int) -> np.ndarray:
if isinstance(n, bool) or not isinstance(n, (int, np.integer)):
raise TypeError("n must be an integer")
n = int(n)
x = np.asarray(data)
if x.ndim != 1:
x = x.ravel()
if np.issubdtype(x.dtype, np.integer):
x_int = x.astype(int)
else:
if np.any(np.floor(x) != x):
raise ValueError("data must be integer-valued")
x_int = x.astype(int)
if np.any(x_int < 1) or np.any(x_int > n):
raise ValueError(f"all observations must be in {{1,...,{n}}}")
return x_int
def zipfian_loglik(a: float, data: np.ndarray, n: int) -> float:
a, n = _validate_a_n(a, n)
x = _validate_sample(data, n)
m = x.size
return float(-a * np.sum(np.log(x)) - m * zipfian_logZ(a, n))
def zipfian_mle_a(data: np.ndarray, n: int, *, bounds=(1e-3, 10.0)):
# MLE for a with fixed n via 1D optimization.
_, n = _validate_a_n(1.0, n) # validate n
x = _validate_sample(data, n)
def nll(a):
if a <= 0:
return np.inf
return -zipfian_loglik(a, x, n)
res = minimize_scalar(nll, bounds=bounds, method="bounded")
return float(res.x), res
# Demo: generate data and recover a
true_a, true_n = 1.35, 60
x = stats.zipfian.rvs(true_a, true_n, size=20_000, random_state=rng)
# If n is unknown, a common MLE choice is n_hat = max(x)
n_hat = int(x.max())
a_hat, opt_res = zipfian_mle_a(x, n_hat)
{
"true_a": true_a,
"true_n": true_n,
"n_hat": n_hat,
"a_hat": a_hat,
"optimizer_success": opt_res.success,
}
{'true_a': 1.35,
'true_n': 60,
'n_hat': 60,
'a_hat': 1.3436470243677288,
'optimizer_success': True}
7) Sampling & Simulation (NumPy-only)#
Because the support is finite, the most direct NumPy-only sampler uses inverse transform sampling:
Compute the PMF \(p_k\) on \(k=1,\dots,n\).
Compute the CDF \(c_k = \sum_{j\le k} p_j\).
Draw \(U\sim\mathrm{Uniform}(0,1)\).
Return the smallest \(k\) such that \(c_k\ge U\).
This is:
\(\mathcal{O}(n)\) to build the CDF (once per parameter setting)
\(\mathcal{O}(\log n)\) per sample using
np.searchsorted
For very large \(n\) with repeated sampling, an alias method can reduce per-sample cost to \(\mathcal{O}(1)\) after an \(\mathcal{O}(n)\) preprocessing step.
def sample_zipfian_numpy(a: float, n: int, size, rng: np.random.Generator) -> np.ndarray:
a, n = _validate_a_n(a, n)
if isinstance(size, tuple):
out_shape = size
size_int = int(np.prod(size))
else:
size_int = int(size)
out_shape = (size_int,)
if size_int < 0:
raise ValueError("size must be non-negative")
ks, pmf = zipfian_pmf_array(a, n)
cdf = np.cumsum(pmf)
cdf[-1] = 1.0 # guard against floating drift
u = rng.random(size_int)
samples = ks[np.searchsorted(cdf, u, side="left")]
return samples.reshape(out_shape)
a, n = 1.2, 80
samples = sample_zipfian_numpy(a, n, size=100_000, rng=rng)
mom = zipfian_moments(a, n)
{
"theory_mean": mom["mean"],
"mc_mean": float(samples.mean()),
"theory_var": mom["var"],
"mc_var": float(samples.var(ddof=0)),
}
{'theory_mean': 11.700596007114799,
'mc_mean': 11.6452,
'theory_var': 289.1532440210312,
'mc_var': 286.57043695999994}
8) Visualization#
We’ll visualize:
the PMF (often most informative on a log–log scale)
the CDF (step function)
Monte Carlo samples: empirical frequencies vs theoretical PMF
a, n = 1.25, 100
ks, pmf = zipfian_pmf_array(a, n)
# CDF values on the support
cdf = np.cumsum(pmf)
# Monte Carlo frequencies
m = 60_000
samps = sample_zipfian_numpy(a, n, size=m, rng=rng)
counts = np.bincount(samps, minlength=n + 1)[1:]
emp_pmf = counts / counts.sum()
fig = make_subplots(
rows=1,
cols=3,
subplot_titles=(
"PMF (log–log)",
"CDF",
"Empirical vs theoretical PMF",
),
)
# PMF
fig.add_trace(
go.Scatter(x=ks, y=pmf, mode="markers+lines", name="theory"),
row=1,
col=1,
)
fig.update_xaxes(type="log", title_text="rank k", row=1, col=1)
fig.update_yaxes(type="log", title_text="P(X=k)", row=1, col=1)
# CDF
fig.add_trace(
go.Scatter(x=ks, y=cdf, mode="lines", name="CDF"),
row=1,
col=2,
)
fig.update_xaxes(title_text="k", row=1, col=2)
fig.update_yaxes(title_text="P(X≤k)", row=1, col=2)
# Empirical vs theoretical
fig.add_trace(
go.Scatter(
x=ks,
y=emp_pmf,
mode="markers",
name="empirical",
opacity=0.6,
),
row=1,
col=3,
)
fig.add_trace(
go.Scatter(
x=ks,
y=pmf,
mode="lines",
name="theory",
),
row=1,
col=3,
)
fig.update_xaxes(type="log", title_text="rank k", row=1, col=3)
fig.update_yaxes(type="log", title_text="probability", row=1, col=3)
fig.update_layout(title=f"Zipfian(a={a}, n={n})")
fig.show()
9) SciPy Integration#
SciPy provides the Zipfian distribution as scipy.stats.zipfian.
Common methods:
PMF / CDF:
stats.zipfian.pmf,stats.zipfian.cdfSampling:
stats.zipfian.rvsFitting:
scipy.stats.fit(stats.zipfian, data, ...)
Notes:
zipfian(a, n)is finite-support on \(\{1,\dots,n\}\).zipf(a)is the infinite-support zeta distribution (requires \(a>1\)).
a, n = 1.3, 50
dist = stats.zipfian(a, n) # frozen distribution
print("support:", dist.support())
print("pmf([1,2,10]):", dist.pmf([1, 2, 10]))
print("cdf([1,2,10]):", dist.cdf([1, 2, 10]))
rvs = dist.rvs(size=10, random_state=rng)
print("rvs:", rvs)
# Moments from SciPy
mean_s, var_s, skew_s, kurt_s = stats.zipfian.stats(a, n, moments="mvsk")
print("SciPy mean/var/skew/kurtosis:", mean_s, var_s, skew_s, kurt_s)
# Fit (MLE by default)
data = stats.zipfian.rvs(1.25, 80, size=15_000, random_state=rng)
fit_res = stats.fit(stats.zipfian, data, bounds={"a": (0.05, 10.0), "n": (2, 500)})
print(fit_res)
print("a_hat, n_hat:", fit_res.params.a, fit_res.params.n)
support: (1, 50)
pmf([1,2,10]): [0.344329 0.139841 0.017257]
cdf([1,2,10]): [0.344329 0.48417 0.787082]
rvs: [44 7 23 1 1 9 1 2 5 1]
SciPy mean/var/skew/kurtosis: 7.347698178395121 105.22290111165175 2.1749257170027665 4.278884110189842
params: FitParams(a=1.2397758165626769, n=80.0, loc=0.0)
success: True
message: 'Optimization terminated successfully.'
a_hat, n_hat: 1.2397758165626769 80.0
10) Statistical Use Cases#
A) Hypothesis testing (goodness-of-fit)#
A common question: does a Zipfian model explain these ranks?
A simple approach:
Fit \(a\) (and possibly \(n\)).
Compare observed counts to expected counts under the fitted model.
Use a chi-square test (or a likelihood-ratio / G-test).
Because parameters are estimated from the same data, classical p-values can be optimistic; a more careful approach uses a parametric bootstrap.
B) Bayesian modeling#
Treat \(a\) as unknown with a prior, e.g. \(a\sim\mathrm{Gamma}(\alpha_0,\beta_0)\). Then
For 1D \(a\), a grid posterior is straightforward.
C) Generative modeling#
Zipfian sampling is a standard building block for synthetic datasets with:
head–tail imbalance
realistic rank-frequency behavior
Examples include generating synthetic word IDs, item popularity, or request patterns.
# A) Chi-square goodness-of-fit demo
true_a, true_n = 1.15, 25
x = stats.zipfian.rvs(true_a, true_n, size=30_000, random_state=rng)
# Fit a with n fixed (here: assume n known)
a_hat, _ = zipfian_mle_a(x, true_n, bounds=(0.05, 5.0))
obs = np.bincount(x, minlength=true_n + 1)[1:]
ks, pmf_hat = zipfian_pmf_array(a_hat, true_n)
exp = x.size * pmf_hat
chi2_stat, p_value = stats.chisquare(f_obs=obs, f_exp=exp)
{
"true_a": true_a,
"a_hat": a_hat,
"chi2": float(chi2_stat),
"p_value_naive": float(p_value),
}
{'true_a': 1.15,
'a_hat': 1.151112295598441,
'chi2': 23.181215720032025,
'p_value_naive': 0.5091224145363746}
# B) Bayesian modeling of a via a grid posterior (n fixed)
x = stats.zipfian.rvs(1.3, 60, size=8_000, random_state=rng)
n = 60
# Prior: Gamma(alpha0, rate=beta0)
alpha0, beta0 = 2.0, 1.0
a_grid = np.linspace(0.2, 4.0, 500)
log_prior = stats.gamma.logpdf(a_grid, a=alpha0, scale=1.0 / beta0)
log_like = np.array([zipfian_loglik(a, x, n) for a in a_grid])
log_post_unnorm = log_prior + log_like
# Normalize on the grid
logZ_post = _logsumexp_np(log_post_unnorm)
post = np.exp(log_post_unnorm - logZ_post)
post = post / post.sum()
post_mean = float(np.sum(a_grid * post))
post_cdf = np.cumsum(post)
ci_low = float(a_grid[np.searchsorted(post_cdf, 0.05)])
ci_high = float(a_grid[np.searchsorted(post_cdf, 0.95)])
fig = go.Figure(
data=[go.Scatter(x=a_grid, y=post, mode="lines", name="posterior")],
layout=dict(
title=f"Posterior over a (n={n}, Gamma({alpha0},{beta0}) prior)",
xaxis_title="a",
yaxis_title="density (grid-normalized)",
),
)
fig.add_vline(x=post_mean, line_dash="dash", line_color="black")
fig.add_vrect(x0=ci_low, x1=ci_high, fillcolor="gray", opacity=0.2, line_width=0)
fig.show()
{"posterior_mean": post_mean, "90%_credible_interval": (ci_low, ci_high)}
{'posterior_mean': 1.295206090499907,
'90%_credible_interval': (1.2813627254509017, 1.311823647294589)}
# C) Generative modeling: synthetic rank-frequency data
a, n_vocab = 1.1, 2_000
n_tokens = 120_000
ranks = sample_zipfian_numpy(a, n_vocab, size=n_tokens, rng=rng)
counts = np.bincount(ranks, minlength=n_vocab + 1)[1:]
freq = counts / counts.sum()
rank = np.arange(1, n_vocab + 1)
# Theoretical PMF for comparison
_, pmf_theory = zipfian_pmf_array(a, n_vocab)
fig = go.Figure()
fig.add_trace(go.Scatter(x=rank, y=freq, mode="markers", name="empirical", opacity=0.6))
fig.add_trace(go.Scatter(x=rank, y=pmf_theory, mode="lines", name="theory"))
fig.update_xaxes(type="log", title_text="rank")
fig.update_yaxes(type="log", title_text="frequency")
fig.update_layout(title="Synthetic rank-frequency (log–log)")
fig.show()
11) Pitfalls#
Invalid parameters
SciPy’s
zipfianexpectsa > 0and integern >= 1.If you treat \(n\) as unknown from i.i.d. data, the likelihood typically pushes to \(\hat n = \max x_i\) (a boundary estimate). In practice, choose \(n\) from domain knowledge.
Numerical issues
For large \(n\) or extreme \(a\), naive computations of \(k^{-a}\) can underflow/overflow. Using log-space (as in
zipfian_logpmf) is more stable.Building the full PMF/CDF is \(\mathcal{O}(n)\) memory/time. For very large \(n\) and repeated sampling, consider alias sampling or SciPy’s implementation.
Model mismatch
Many empirical “Zipf-like” datasets deviate in the head or tail (mixtures, cutoffs, measurement bias). Always inspect residuals / goodness-of-fit.
12) Summary#
zipfian(a, n)is a discrete truncated power law on ranks \(\{1,\dots,n\}\) with PMF \(p_k \propto k^{-a}\).The normalizer is the generalized harmonic number \(H_{n,a}\).
Finite support implies all moments exist, with compact expressions via harmonic numbers.
Inverse-CDF sampling gives a simple NumPy-only sampler.
SciPy’s
scipy.stats.zipfianprovides PMF/CDF/RVS and MLE-style fitting viascipy.stats.fit.